{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ad05363f",
   "metadata": {},
   "source": [
    "## Key2\n",
    "\n",
    "对于病理的任务，将病理的所有tiles转化成histogram或者tfidf的patient特征。\n",
    "\n",
    "![](http://www.medai.icu/storage/attachments/2022/06/26/n41q4HeDjvIOZnyfoKH28c5YNioGdB7OZwO35XOf.png)\n",
    "\n",
    "参考论文: [Development and interpretation of a pathomics-based model for the prediction of microsatellite instability in Colorectal Cancer](http://www.medai.icu/download?url=http://www.medai.icu/apiv3/attachment.download?sign=1667478d908313ae1e01543e229d02de&attachmentsId=1061&threadId=230)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2025640",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from onekey_algo.custom.utils import key2\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "log = pd.read_csv('C:/Users/onekey/Desktop/onekey_comp/comp4-What（分类识别）/20220710/resnet18/viz/BST_RESULTS.txt', sep='\\t',\n",
    "                 names=['fname', 'prob', 'pred', 'gt'])\n",
    "log['prob'] = list(map(lambda x: x[0] if x[1] == 1 else 1-x[0], np.array(log[['prob', 'pred']])))\n",
    "log[['group']] = log[['fname']].applymap(lambda x: os.path.basename(x).split('_')[0])\n",
    "log['prob'] = log['prob'].round(decimals=2)\n",
    "log.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66518213",
   "metadata": {},
   "source": [
    "### 直方图\n",
    "\n",
    "```python\n",
    "def key2histogram(data: pd.DataFrame, group_column: str, histo_columns: Union[str, List[str]],\n",
    "                  histo_lists: Union[list, List[list]] = None, default_value=0, norm: bool = False):\n",
    "    \"\"\"\n",
    "    所有的数据生成直方图特征， 多个histo_columns存在是，所有的特征进行横向拼接。\n",
    "    Args:\n",
    "        data: 数据\n",
    "        group_column: 样本分组的列明，一般为ID\n",
    "        histo_columns: 用来计算直方图的列，如果为多列，则每列计算完直方图，然后特征拼接\n",
    "        histo_lists: None或者与histo_columns个数相同，为自己指定特征列表\n",
    "        default_value: 不存在特征时的默认值\n",
    "        norm: 要不要归一化。\n",
    "    Returns:\n",
    "\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1723ec3",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = key2.key2histogram(log, group_column='group',histo_columns='prob', norm=True)\n",
    "# results.to_csv('histogram.csv', header=True, index=False)\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fab1426",
   "metadata": {},
   "source": [
    "### TF-IDF\n",
    "\n",
    "```python\n",
    "def key2tfidf(data: pd.DataFrame, group_column: str, corpus_columns: Union[str, List[str]]):\n",
    "    \"\"\"\n",
    "    所有的数据生成直方图特征， 多个corpus_columns存在是，所有的特征进行横向拼接。\n",
    "    Args:\n",
    "        data: 数据\n",
    "        group_column: 样本分组的列明，一般为ID\n",
    "        corpus_columns: 用来计算作为语料的列明。\n",
    "    Returns:\n",
    "\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc562191",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = key2.key2tfidf(log, group_column='group',corpus_columns='prob')\n",
    "# results.to_csv('tfidf.csv', header=True, index=False)\n",
    "results"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
